Home » Python with Pandas Case 1: Merge CSV files, grouping by columns and sum values

Python with Pandas Case 1: Merge CSV files, grouping by columns and sum values

Sample Data 1 (node1)

version=9.0
date,instance,MSU's transmitted,MSU's received,MSU's octets transmitted,MSU's octets received,MSUs received requiring GTT,MSU octets received for MSUs requiring GTT,Duration of link set inactivity
2020-10-26-00:05,LKSET_1_TEAS-TEAS,528,570,11000,26459,0,0,0
2020-10-26-00:05,LKSET_5_TIHO-TIHO,18,10,375,228,0,0,0
2020-10-26-00:05,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,202,179,23462,16303,159,15883,0
2020-10-26-00:05,LKSET_1_LISTP6017NLS-LISTP6017NLS,21495,19685,2183024,3279609,19675,3279259,0
2020-10-26-00:05,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,20,20,700,700,0,0,0
2020-10-26-00:05,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,586,511,16769,12248,0,0,0
2020-10-26-00:10,LKSET_1_TEAS-TEAS,337,410,6623,19562,0,0,0
2020-10-26-00:10,LKSET_5_TIHO-TIHO,18,10,369,225,0,0,0
2020-10-26-00:10,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,187,191,21560,16699,170,16261,0
2020-10-26-00:10,LKSET_1_LISTP6017NLS-LISTP6017NLS,21669,19157,2185071,3187426,19147,3187076,0
2020-10-26-00:10,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,20,20,700,700,0,0,0
2020-10-26-00:10,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,349,289,11126,6949,0,0,0
2020-10-26-00:15,LKSET_1_TEAS-TEAS,324,337,6438,17652,0,0,0
2020-10-26-00:15,LKSET_5_TIHO-TIHO,30,10,600,211,0,0,0
2020-10-26-00:15,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,142,143,15848,12923,123,12503,0
2020-10-26-00:15,LKSET_1_LISTP6017NLS-LISTP6017NLS,20341,18522,2053465,3086120,18512,3085770,0
2020-10-26-00:15,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,20,20,700,700,0,0,0
2020-10-26-00:15,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,247,295,7634,7535,0,0,0

Sample Data 2 (node2)

version=9.0
date,instance,MSU's transmitted,MSU's received,MSU's octets transmitted,MSU's octets received,MSUs received requiring GTT,MSU octets received for MSUs requiring GTT,Duration of link set inactivity
2020-10-26-00:05,LKSET_1_TEAS-TEAS,978,440,40479,21912,0,0,0
2020-10-26-00:05,LKSET_5_TIHO-TIHO,25,31,510,757,0,0,0
2020-10-26-00:05,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,186,184,21524,16797,164,16377,0
2020-10-26-00:05,LKSET_1_LISTP6017NLS-LISTP6017NLS,21618,22411,2184153,3382136,19615,3269682,0
2020-10-26-00:05,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,20,20,700,700,0,0,0
2020-10-26-00:05,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,542,583,14488,13280,0,0,0
2020-10-26-00:10,LKSET_1_TEAS-TEAS,753,337,31402,17535,0,0,0
2020-10-26-00:10,LKSET_5_TIHO-TIHO,23,30,478,821,0,0,0
2020-10-26-00:10,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,200,176,22367,15239,156,14819,0
2020-10-26-00:10,LKSET_1_LISTP6017NLS-LISTP6017NLS,21613,22072,2189857,3325865,19293,3213985,0
2020-10-26-00:10,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,20,20,700,700,0,0,0
2020-10-26-00:10,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,331,446,9410,10521,0,0,0
2020-10-26-00:15,LKSET_1_TEAS-TEAS,756,357,32251,17968,0,0,0
2020-10-26-00:15,LKSET_5_TIHO-TIHO,27,48,561,1910,0,0,0
2020-10-26-00:15,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,161,146,17802,12141,125,11703,0
2020-10-26-00:15,LKSET_1_LISTP6017NLS-LISTP6017NLS,19821,20965,2022165,3179670,18516,3080600,0
2020-10-26-00:15,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,20,20,700,700,0,0,0
2020-10-26-00:15,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,304,419,8419,10012,0,0,0

We want a result file to sum the counters of each instance (i.e. link)

Output

date,instance,MSU's transmitted,MSU's received,MSU's octets transmitted,MSU's octets received,MSUs received requiring GTT,MSU octets received for MSUs requiring GTT,Duration of link set inactivity
2020-10-26-00:05,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,1128,1094,31257,25528,0,0,0
2020-10-26-00:05,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,388,363,44986,33100,323,32260,0
2020-10-26-00:05,LKSET_1_LISTP6017NLS-LISTP6017NLS,43113,42096,4367177,6661745,39290,6548941,0
2020-10-26-00:05,LKSET_1_TEAS-TEAS,1506,1010,51479,48371,0,0,0
2020-10-26-00:05,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,40,40,1400,1400,0,0,0
2020-10-26-00:05,LKSET_5_TIHO-TIHO,43,41,885,985,0,0,0
2020-10-26-00:10,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,680,735,20536,17470,0,0,0
2020-10-26-00:10,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,387,367,43927,31938,326,31080,0
2020-10-26-00:10,LKSET_1_LISTP6017NLS-LISTP6017NLS,43282,41229,4374928,6513291,38440,6401061,0
2020-10-26-00:10,LKSET_1_TEAS-TEAS,1090,747,38025,37097,0,0,0
2020-10-26-00:10,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,40,40,1400,1400,0,0,0
2020-10-26-00:10,LKSET_5_TIHO-TIHO,41,40,847,1046,0,0,0
2020-10-26-00:15,LKSET_1_BOMAR2675NLS-BOMAR2675NLS,551,714,16053,17547,0,0,0
2020-10-26-00:15,LKSET_1_BOOKSCP92NLS-BOOKSCP92NLS,303,289,33650,25064,248,24206,0
2020-10-26-00:15,LKSET_1_LISTP6017NLS-LISTP6017NLS,40162,39487,4075630,6265790,37028,6166370,0
2020-10-26-00:15,LKSET_1_TEAS-TEAS,1080,694,38689,35620,0,0,0
2020-10-26-00:15,LKSET_4_PAXMAR01NLS-PAXMAR01NLS,40,40,1400,1400,0,0,0
2020-10-26-00:15,LKSET_5_TIHO-TIHO,57,58,1161,2121,0,0,0

Let’s go through the code

# import pandas (aliased as 'pd'), glob and os libraries to make it available for use
import pandas as pd
import glob
import os
# glob to retrieve files in the data directory of the nodes
list_of_files = glob.glob('/data/SUBOK/sample_data_1/*.csv')
list_of_files = glob.glob('/data/SUBOK/sample_data_2/*.csv')
# data directory contains daily files and we are only interested with latest file
latest_file1 = max(list_of_files, key=os.path.getctime)
latest_file2 = max(list_of_files, key=os.path.getctime)
# skiprows=1 : in the sample data we want to drop row=0 and keep row=1 as header
# pd.read_csv : read the file (i.e. latest file for this example
# pd.concat to concatenate the (latest) files
df = pd.concat([pd.read_csv(latest_file1,skiprows=1),pd.read_csv(latest_file2,skiprows=1)])
# df.groupby : group values in the dataset based on the column selected
# sum() : sum up the counters in the group
result=df.groupby(['date','instance']).sum()
# direct the output to a result file
result.to_csv('/data/SUBOK/results/result_linkset.csv')

Leave a Reply

Your email address will not be published. Required fields are marked *