Python: how to convert elements of a list of lists

2019-08-04 06:04发布

问题:

I have a program which retrieves a list of PubMed publications and wish to build a graph of co-authorship, meaning that for each article I want to add each author (if not already present) as a vertex and add an undirected edge (or increase its weight) between every coauthor.

I managed to write the first of the program which retrieves the list of authors for each publication and understand I could use the NetworkX library to build the graph (and then export it to GraphML for Gephi) but cannot wrap my head on how to transform the "list of lists" to a graph.

Here follows my code. Thank you very much.

### if needed install the required modules
### python3 -m pip install biopython
### python3 -m pip install numpy

from Bio import Entrez
from Bio import Medline
Entrez.email = "rja@it.com"
handle = Entrez.esearch(db="pubmed", term='("lung diseases, interstitial"[MeSH Terms] NOT "pneumoconiosis"[MeSH Terms]) AND "artificial intelligence"[MeSH Terms] AND "humans"[MeSH Terms]', retmax="1000", sort="relevance", retmode="xml")
records = Entrez.read(handle)
ids = records['IdList']
h = Entrez.efetch(db='pubmed', id=ids, rettype='medline', retmode='text')
#now h holds all of the articles and their sections
records = Medline.parse(h)
# initialize an empty vector for the authors
authors = []
# iterate through all articles
for record in records:
    #for each article (record) get the authors list
    au = record.get('AU', '?')
    # now from the author list iterate through each author
    for a in au: 
        if a not in authors:
            authors.append(a)
    # following is just to show the alphabetic list of all non repeating 
    # authors sorted alphabetically (there should become my graph nodes)
    authors.sort()
    print('Authors: {0}'.format(', '.join(authors)))

回答1:

Cool - the code was running, so the data structures are clear! As an approach, we build the conncetivity-matrix for both articles/authors and authors/co-authors.

List of authors : If you want to describe the relation between the articles and the authors, I think you need the author list of each article

authors = []
author_lists = []              # <--- new
for record in records:
    au = record.get('AU', '?')
    author_lists.append(au)    # <--- new
    for a in au: 
        if a not in authors: authors.append(a)
authors.sort()
print(authors)

numpy, pandas matplotlib - is just the way I am used to work

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

AU = np.array(authors)        # authors as np-array
NA = AU.shape[0]              # number of authors

NL = len(author_lists)        # number of articles/author lists
AUL = np.array(author_lists)  # author lists as np-array

print('NA, NL', NA,NL)

Connectivity articles/authors

CON = np.zeros((NL,NA),dtype=int) # initializes connectivity matrix
for j in range(NL):               # run through the article's author list 
    aul = np.array(AUL[j])        # get a single author list as np-array
    z = np.zeros((NA),dtype=int)
    for k in range(len(aul)):     # get a singel author
        z += (AU==aul[k])         # get it's position in the AU, add it  up
    CON[j,:] = z                  # insert the result in the connectivity matrix

#---- grafics --------
fig = plt.figure(figsize=(20,10)) ; 
plt.spy(CON, marker ='s', color='chartreuse', markersize=5)
plt.xlabel('Authors'); plt.ylabel('Articles'); plt.title('Authors of the articles', fontweight='bold')
plt.show()

Connectivity authors/co-authors, the resulting matrix is symmetric

df = pd.DataFrame(CON)          # let's use pandas for the following step
ACON = np.zeros((NA,NA))         # initialize the conncetivity matrix
for j in range(NA):              # run through the authors
    df_a = df[df.iloc[:, j] >0]  # give all rows with author j involved
    w = np.array(df_a.sum())     # sum the rows, store it in np-array 
    ACON[j] = w                  # insert it in the connectivity matrix

#---- grafics --------
fig = plt.figure(figsize=(10,10)) ; 
plt.spy(ACON, marker ='s', color='chartreuse', markersize=3)
plt.xlabel('Authors'); plt.ylabel('Authors'); plt.title('Authors that are co-authors', fontweight='bold')
plt.show()

For the graphics with Networkx, I think think you need clear ideas what you want represent, because there are many points and many possibilities too (perhaps you post an example?). Only a few author-circels are ploted below.

import networkx as nx

def set_edges(Q):
    case = 'A'
    if case=='A':
        Q1 = np.roll(Q,shift=1)
        Edges = np.vstack((Q,Q1)).T
    return Edges

Q = nx.Graph()
Q.clear()

AT = np.triu(ACON)                        # only the tridiagonal is needed
fig = plt.figure(figsize=(7,7)) ;
for k in range (9):
    iA = np.argwhere(AT[k]>0).ravel()     # get the indices with AT{k}>0
    Edges = set_edges(iA)                 # select the involved nodes and set the edges
    Q.add_edges_from(Edges, with_labels=True)
nx.draw(Q, alpha=0.5)
plt.title('Co-author-ship', fontweight='bold')
plt.show()