To read files you can scan all the subfolders in your base path, looking for files named body.txt
or timestamp.txt
. You can use os.walk
: it recursively traverses the tree having the base path as root node and, at each iteration, it returns the path of the current folder, a list with the names of the folders it contains, and a list of the files in that folder; os.walk
explores all the subfolders in the basepath. Then, you can convert each timestamp.txt
or body.txt
in a pandas
DataFrame
and concatenate them together. I suppose that both files have one sample per line, something like this:
timestamp.txt
1609004661.419179
1609004662.419179
1609004663.419179
1609004664.419179
body.txt
b
g
b
g
j
As you noticed, timestamps are assumed to be in UNIX timestamp format.
Here is an example of the code:
import os
import pandas as pd
BASEPATH = 'your base path'
BODY_NAME = 'body.txt'
TIMESTAMP_NAME = 'timestamp.txt'
bds = [] # list to collect data in 'body.txt' files
ts = []
for parent_dir, folders, files in os.walk(BASEPATH):
print(parent_dir, folders, files)
try:
bds.append(pd.read_csv(os.path.join(parent_dir, BODY_NAME), header=None, names=['body']))
ts.append(
pd.read_csv(os.path.join(parent_dir, TIMESTAMP_NAME), header=None, names=['timestamp'],
converters={'timestamp': lambda x: pd.to_datetime(float(x), unit='s')})
)
except FileNotFoundError:
pass # handle folders without one of the two files
data = pd.concat((pd.concat(ts, axis=0), pd.concat(bds, axis=0)), axis=1).sort_values('timestamp').reset_index(drop=True)
Running the previous code, data
is something like this:
timestamp body
0 2020-12-26 17:44:21.419178963 b
1 2020-12-26 17:44:21.419178963 c
2 2020-12-26 17:44:21.421484947 f
3 2020-12-26 17:44:21.421484947 e
4 2020-12-26 17:44:21.421484947 j
5 2020-12-26 17:44:22.419178963 f
6 2020-12-26 17:44:22.419178963 g
7 2020-12-26 17:44:22.421484947 f
8 2020-12-26 17:44:22.421484947 f
9 2020-12-26 17:44:22.421484947 i
10 2020-12-26 17:44:23.419178963 k
11 2020-12-26 17:44:23.419178963 b
12 2020-12-26 17:44:23.421484947 b
13 2020-12-26 17:44:23.421484947 d
14 2020-12-26 17:44:23.421484947 i
For the second part of your question, you can collect data in an interval of two seconds in different DataFrame
s using resample
(after making timestamp
the index of the DataFrame
):
data.set_index('timestamp', drop=True, inplace=True)
resampled = data.resample('2s')
df_2sec = {}
for i, (timestamp, df) in enumerate(resampled):
df_2sec[i] = df
Here the first two of these DataFrame
s (df_2sec[0]
and df_2sec[1]
):
body
timestamp
2020-12-26 17:44:21.419178963 b
2020-12-26 17:44:21.419178963 c
2020-12-26 17:44:21.421484947 f
2020-12-26 17:44:21.421484947 e
2020-12-26 17:44:21.421484947 j
2020-12-26 17:44:22.419178963 f
2020-12-26 17:44:22.419178963 g
2020-12-26 17:44:22.421484947 f
2020-12-26 17:44:22.421484947 f
2020-12-26 17:44:22.421484947 i
body
timestamp
2020-12-26 17:44:23.419178963 k
2020-12-26 17:44:23.419178963 b
2020-12-26 17:44:23.421484947 b
2020-12-26 17:44:23.421484947 d
2020-12-26 17:44:23.421484947 i
2020-12-26 17:44:24.419178963 g
2020-12-26 17:44:24.419178963 d
2020-12-26 17:44:24.421484947 i
2020-12-26 17:44:24.421484947 e
2020-12-26 17:44:24.421484947 i
You can change the starting timestamp of the resample
(I chose the first timestamp in the DataFrame
, using the start
value for origin
argument): take a look at the documentation for more info.
If you prefer, you can convert each DataFrame
to a dict
, changing df_2sec[i] = df
to df_2sec[i] = df.to_dict()
.
NOTE:
I used the following code to create the structure of your folder, in case you want to reproduce data from my answer.
import os
import numpy as np
import datetime
import pandas as pd
BASEPATH = 'your base path'
BODY_NAME = 'body.txt'
TIMESTAMP_NAME = 'timestamp.txt'
BODIES = 'abcdefghijk'
FOLDER = 'abcdefghijklmnopqrstuvwxyz'
def create_rnd_folder(basepath, level=2, timestamp=None):
"""Generate two-levels random directories"""
if level == 0:
n_samples = np.random.randint(1, 100)
with open(os.path.join(basepath, TIMESTAMP_NAME), 'w') as f:
for i in range(n_samples):
f.write(f'{timestamp + i}
')
with open(os.path.join(basepath, BODY_NAME), 'w') as f:
for s in np.random.choice(list(BODIES), size=n_samples):
f.write(f'{s}
')
else:
new_folders = [np.random.choice(list(FOLDER), size=3 + level) for _ in range(np.random.randint(2, 4))]
for nf in map(''.join, new_folders):
new_folder = os.path.join(basepath, nf)
os.makedirs(new_folder, exist_ok=True)
_create_rnd_folder(
new_folder, level=level - 1,
timestamp=(datetime.datetime.utcnow() - datetime.timedelta(days=30)).timestamp() if level == 2 else timestamp
)
os.makedirs(BASEPATH, exist_ok=False)
create_rnd_folder(BASEPATH, level=2, timestamp=None)