Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
65 views
in Technique[技术] by (71.8m points)

python - Create dict from txt files in multiple directories

I have a series of directories in a structure somewhat like:

E.G.

basepath>abcd
basepath>abcd>aba
basepath>abcd>aba>abb
basepath>abcd>aba
basepath>abcd>abd
basepath>abcd>abd>add

Other than basepath, directory names are randomly generated strings. There are a few hundred different directories.

Within every directory I have 2 files, 'body.txt' and 'timestamp.txt'. As an end goal I'd like to have collected every body within a common range of time, such as splitting every body into 1 hour intervals. The data in timestamp.txt is a integer of seconds.

I'm imagining the first challenge will be to get a list of every directory and subdirectory. Can anyone suggest what I can use to get a list of subdirectories under basepath?

Then I need a way to sort and organise this data. I know that Pandas has a way that I can split data by date, that is possibly my best option that I know of. If anyone had any suggestion of a different method I could do that I'd love to hear it.

As an example of how I'd organise + split the data:

Timestamp(s)        Body
300                  a
301                  b
304                  c
306                  d
301                  e
304                  f
301                  g
306                  h
308                  i
307                  j

Split as an interval of 2 secs

Timestamp(s)        Body
300                  a
301                  b
301                  e
301                  g

304                  c
304                  f

306                  d
306                  h

307                  j
308                  i
question from:https://stackoverflow.com/questions/65889730/create-dict-from-txt-files-in-multiple-directories

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

To read files you can scan all the subfolders in your base path, looking for files named body.txt or timestamp.txt. You can use os.walk: it recursively traverses the tree having the base path as root node and, at each iteration, it returns the path of the current folder, a list with the names of the folders it contains, and a list of the files in that folder; os.walk explores all the subfolders in the basepath. Then, you can convert each timestamp.txt or body.txt in a pandas DataFrame and concatenate them together. I suppose that both files have one sample per line, something like this:

timestamp.txt

1609004661.419179
1609004662.419179
1609004663.419179
1609004664.419179
body.txt

b
g
b
g
j

As you noticed, timestamps are assumed to be in UNIX timestamp format.

Here is an example of the code:

import os
import pandas as pd

BASEPATH = 'your base path'
BODY_NAME = 'body.txt'
TIMESTAMP_NAME = 'timestamp.txt'

bds = []  # list to collect data in 'body.txt' files
ts = []
for parent_dir, folders, files in os.walk(BASEPATH):
    print(parent_dir, folders, files)
    try:
        bds.append(pd.read_csv(os.path.join(parent_dir, BODY_NAME), header=None, names=['body']))
        ts.append(
            pd.read_csv(os.path.join(parent_dir, TIMESTAMP_NAME), header=None, names=['timestamp'],
                        converters={'timestamp': lambda x: pd.to_datetime(float(x), unit='s')})
        )
    except FileNotFoundError:
        pass  # handle folders without one of the two files
data = pd.concat((pd.concat(ts, axis=0), pd.concat(bds, axis=0)), axis=1).sort_values('timestamp').reset_index(drop=True)

Running the previous code, data is something like this:

                       timestamp body
0  2020-12-26 17:44:21.419178963    b
1  2020-12-26 17:44:21.419178963    c
2  2020-12-26 17:44:21.421484947    f
3  2020-12-26 17:44:21.421484947    e
4  2020-12-26 17:44:21.421484947    j
5  2020-12-26 17:44:22.419178963    f
6  2020-12-26 17:44:22.419178963    g
7  2020-12-26 17:44:22.421484947    f
8  2020-12-26 17:44:22.421484947    f
9  2020-12-26 17:44:22.421484947    i
10 2020-12-26 17:44:23.419178963    k
11 2020-12-26 17:44:23.419178963    b
12 2020-12-26 17:44:23.421484947    b
13 2020-12-26 17:44:23.421484947    d
14 2020-12-26 17:44:23.421484947    i

For the second part of your question, you can collect data in an interval of two seconds in different DataFrames using resample (after making timestamp the index of the DataFrame):

data.set_index('timestamp', drop=True, inplace=True)
resampled = data.resample('2s')
df_2sec = {}
for i, (timestamp, df) in enumerate(resampled):
    df_2sec[i] = df

Here the first two of these DataFrames (df_2sec[0] and df_2sec[1]):

                              body
timestamp                         
2020-12-26 17:44:21.419178963    b
2020-12-26 17:44:21.419178963    c
2020-12-26 17:44:21.421484947    f
2020-12-26 17:44:21.421484947    e
2020-12-26 17:44:21.421484947    j
2020-12-26 17:44:22.419178963    f
2020-12-26 17:44:22.419178963    g
2020-12-26 17:44:22.421484947    f
2020-12-26 17:44:22.421484947    f
2020-12-26 17:44:22.421484947    i
                              body
timestamp                         
2020-12-26 17:44:23.419178963    k
2020-12-26 17:44:23.419178963    b
2020-12-26 17:44:23.421484947    b
2020-12-26 17:44:23.421484947    d
2020-12-26 17:44:23.421484947    i
2020-12-26 17:44:24.419178963    g
2020-12-26 17:44:24.419178963    d
2020-12-26 17:44:24.421484947    i
2020-12-26 17:44:24.421484947    e
2020-12-26 17:44:24.421484947    i

You can change the starting timestamp of the resample (I chose the first timestamp in the DataFrame, using the start value for origin argument): take a look at the documentation for more info.

If you prefer, you can convert each DataFrame to a dict, changing df_2sec[i] = df to df_2sec[i] = df.to_dict().

NOTE: I used the following code to create the structure of your folder, in case you want to reproduce data from my answer.

import os
import numpy as np
import datetime
import pandas as pd


BASEPATH = 'your base path'
BODY_NAME = 'body.txt'
TIMESTAMP_NAME = 'timestamp.txt'

BODIES = 'abcdefghijk'
FOLDER = 'abcdefghijklmnopqrstuvwxyz'

def create_rnd_folder(basepath, level=2, timestamp=None):
    """Generate two-levels random directories"""
    if level == 0:
        n_samples = np.random.randint(1, 100)
        with open(os.path.join(basepath, TIMESTAMP_NAME), 'w') as f:
            for i in range(n_samples):
                f.write(f'{timestamp + i}
')

        with open(os.path.join(basepath, BODY_NAME), 'w') as f:
            for s in np.random.choice(list(BODIES), size=n_samples):
                f.write(f'{s}
')
    else:
        new_folders = [np.random.choice(list(FOLDER), size=3 + level) for _ in range(np.random.randint(2, 4))]
        for nf in map(''.join, new_folders):
            new_folder = os.path.join(basepath, nf)
            os.makedirs(new_folder, exist_ok=True)
            _create_rnd_folder(
                new_folder, level=level - 1,
                timestamp=(datetime.datetime.utcnow() - datetime.timedelta(days=30)).timestamp() if level == 2 else timestamp
            )


os.makedirs(BASEPATH, exist_ok=False)
create_rnd_folder(BASEPATH, level=2, timestamp=None)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...